{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Categorical Variables and One Hot Encoding</h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>town</th>\n",
       "      <th>area</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>2600</td>\n",
       "      <td>550000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>3000</td>\n",
       "      <td>565000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>3200</td>\n",
       "      <td>610000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>3600</td>\n",
       "      <td>680000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>4000</td>\n",
       "      <td>725000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>west windsor</td>\n",
       "      <td>2600</td>\n",
       "      <td>585000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>west windsor</td>\n",
       "      <td>2800</td>\n",
       "      <td>615000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>west windsor</td>\n",
       "      <td>3300</td>\n",
       "      <td>650000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>west windsor</td>\n",
       "      <td>3600</td>\n",
       "      <td>710000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>robinsville</td>\n",
       "      <td>2600</td>\n",
       "      <td>575000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>robinsville</td>\n",
       "      <td>2900</td>\n",
       "      <td>600000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>robinsville</td>\n",
       "      <td>3100</td>\n",
       "      <td>620000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>robinsville</td>\n",
       "      <td>3600</td>\n",
       "      <td>695000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               town  area   price\n",
       "0   monroe township  2600  550000\n",
       "1   monroe township  3000  565000\n",
       "2   monroe township  3200  610000\n",
       "3   monroe township  3600  680000\n",
       "4   monroe township  4000  725000\n",
       "5      west windsor  2600  585000\n",
       "6      west windsor  2800  615000\n",
       "7      west windsor  3300  650000\n",
       "8      west windsor  3600  710000\n",
       "9       robinsville  2600  575000\n",
       "10      robinsville  2900  600000\n",
       "11      robinsville  3100  620000\n",
       "12      robinsville  3600  695000"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"homeprices.csv\")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 style='color:purple'>Using pandas to create dummy variables</h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>monroe township</th>\n",
       "      <th>robinsville</th>\n",
       "      <th>west windsor</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    monroe township  robinsville  west windsor\n",
       "0                 1            0             0\n",
       "1                 1            0             0\n",
       "2                 1            0             0\n",
       "3                 1            0             0\n",
       "4                 1            0             0\n",
       "5                 0            0             1\n",
       "6                 0            0             1\n",
       "7                 0            0             1\n",
       "8                 0            0             1\n",
       "9                 0            1             0\n",
       "10                0            1             0\n",
       "11                0            1             0\n",
       "12                0            1             0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dummies = pd.get_dummies(df.town)\n",
    "dummies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>town</th>\n",
       "      <th>area</th>\n",
       "      <th>price</th>\n",
       "      <th>monroe township</th>\n",
       "      <th>robinsville</th>\n",
       "      <th>west windsor</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>2600</td>\n",
       "      <td>550000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>3000</td>\n",
       "      <td>565000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>3200</td>\n",
       "      <td>610000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>3600</td>\n",
       "      <td>680000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monroe township</td>\n",
       "      <td>4000</td>\n",
       "      <td>725000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>west windsor</td>\n",
       "      <td>2600</td>\n",
       "      <td>585000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>west windsor</td>\n",
       "      <td>2800</td>\n",
       "      <td>615000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>west windsor</td>\n",
       "      <td>3300</td>\n",
       "      <td>650000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>west windsor</td>\n",
       "      <td>3600</td>\n",
       "      <td>710000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>robinsville</td>\n",
       "      <td>2600</td>\n",
       "      <td>575000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>robinsville</td>\n",
       "      <td>2900</td>\n",
       "      <td>600000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>robinsville</td>\n",
       "      <td>3100</td>\n",
       "      <td>620000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>robinsville</td>\n",
       "      <td>3600</td>\n",
       "      <td>695000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               town  area   price  monroe township  robinsville  west windsor\n",
       "0   monroe township  2600  550000                1            0             0\n",
       "1   monroe township  3000  565000                1            0             0\n",
       "2   monroe township  3200  610000                1            0             0\n",
       "3   monroe township  3600  680000                1            0             0\n",
       "4   monroe township  4000  725000                1            0             0\n",
       "5      west windsor  2600  585000                0            0             1\n",
       "6      west windsor  2800  615000                0            0             1\n",
       "7      west windsor  3300  650000                0            0             1\n",
       "8      west windsor  3600  710000                0            0             1\n",
       "9       robinsville  2600  575000                0            1             0\n",
       "10      robinsville  2900  600000                0            1             0\n",
       "11      robinsville  3100  620000                0            1             0\n",
       "12      robinsville  3600  695000                0            1             0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dummies= pd.concat([df,dummies],axis='columns')\n",
    "df_dummies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>area</th>\n",
       "      <th>price</th>\n",
       "      <th>monroe township</th>\n",
       "      <th>robinsville</th>\n",
       "      <th>west windsor</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2600</td>\n",
       "      <td>550000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3000</td>\n",
       "      <td>565000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3200</td>\n",
       "      <td>610000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3600</td>\n",
       "      <td>680000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4000</td>\n",
       "      <td>725000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2600</td>\n",
       "      <td>585000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2800</td>\n",
       "      <td>615000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>3300</td>\n",
       "      <td>650000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>3600</td>\n",
       "      <td>710000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2600</td>\n",
       "      <td>575000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2900</td>\n",
       "      <td>600000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>3100</td>\n",
       "      <td>620000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>3600</td>\n",
       "      <td>695000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    area   price  monroe township  robinsville  west windsor\n",
       "0   2600  550000                1            0             0\n",
       "1   3000  565000                1            0             0\n",
       "2   3200  610000                1            0             0\n",
       "3   3600  680000                1            0             0\n",
       "4   4000  725000                1            0             0\n",
       "5   2600  585000                0            0             1\n",
       "6   2800  615000                0            0             1\n",
       "7   3300  650000                0            0             1\n",
       "8   3600  710000                0            0             1\n",
       "9   2600  575000                0            1             0\n",
       "10  2900  600000                0            1             0\n",
       "11  3100  620000                0            1             0\n",
       "12  3600  695000                0            1             0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dummies.drop('town',axis='columns',inplace=True)\n",
    "df_dummies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 style='color:purple'>Dummy Variable Trap</h3>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you can derive one variable from other variables, they are known to be multi-colinear. Here\n",
    "if you know values of california and georgia then you can easily infer value of new jersey state, i.e. \n",
    "california=0 and georgia=0. There for these state variables are called to be multi-colinear. In this\n",
    "situation linear regression won't work as expected. Hence you need to drop one column. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one of the \n",
    "    state columns it is going to work, however we should make a habit of taking care of dummy variable\n",
    "    trap ourselves just in case library that you are using is not handling this for you**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>area</th>\n",
       "      <th>price</th>\n",
       "      <th>monroe township</th>\n",
       "      <th>robinsville</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2600</td>\n",
       "      <td>550000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3000</td>\n",
       "      <td>565000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3200</td>\n",
       "      <td>610000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3600</td>\n",
       "      <td>680000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4000</td>\n",
       "      <td>725000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2600</td>\n",
       "      <td>585000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2800</td>\n",
       "      <td>615000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>3300</td>\n",
       "      <td>650000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>3600</td>\n",
       "      <td>710000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2600</td>\n",
       "      <td>575000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2900</td>\n",
       "      <td>600000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>3100</td>\n",
       "      <td>620000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>3600</td>\n",
       "      <td>695000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    area   price  monroe township  robinsville\n",
       "0   2600  550000                1            0\n",
       "1   3000  565000                1            0\n",
       "2   3200  610000                1            0\n",
       "3   3600  680000                1            0\n",
       "4   4000  725000                1            0\n",
       "5   2600  585000                0            0\n",
       "6   2800  615000                0            0\n",
       "7   3300  650000                0            0\n",
       "8   3600  710000                0            0\n",
       "9   2600  575000                0            1\n",
       "10  2900  600000                0            1\n",
       "11  3100  620000                0            1\n",
       "12  3600  695000                0            1"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dummies.drop('west windsor',axis='columns',inplace=True)\n",
    "df_dummies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = df_dummies.drop('price',axis='columns')\n",
    "y = df_dummies.price\n",
    "\n",
    "from sklearn.linear_model import LinearRegression\n",
    "model = LinearRegression()\n",
    "model.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>area</th>\n",
       "      <th>monroe township</th>\n",
       "      <th>robinsville</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2600</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3200</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3600</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2600</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2800</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>3300</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>3600</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2600</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2900</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>3100</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>3600</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    area  monroe township  robinsville\n",
       "0   2600                1            0\n",
       "1   3000                1            0\n",
       "2   3200                1            0\n",
       "3   3600                1            0\n",
       "4   4000                1            0\n",
       "5   2600                0            0\n",
       "6   2800                0            0\n",
       "7   3300                0            0\n",
       "8   3600                0            0\n",
       "9   2600                0            1\n",
       "10  2900                0            1\n",
       "11  3100                0            1\n",
       "12  3600                0            1"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 539709.7398409 ,  590468.71640508,  615848.20468716,\n",
       "        666607.18125134,  717366.15781551,  579723.71533005,\n",
       "        605103.20361213,  668551.92431735,  706621.15674048,\n",
       "        565396.15136531,  603465.38378844,  628844.87207052,\n",
       "        692293.59277574])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict(X) # 2600 sqr ft home in new jersey"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.95739290372218733"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.score(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 681241.66845839])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 590775.63964739])"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 style='color:purple'>Using sklearn OneHotEncoder</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First step is to use label encoder to convert town names into numbers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "le = LabelEncoder()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>town</th>\n",
       "      <th>area</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>2600</td>\n",
       "      <td>550000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>3000</td>\n",
       "      <td>565000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>3200</td>\n",
       "      <td>610000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>3600</td>\n",
       "      <td>680000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>4000</td>\n",
       "      <td>725000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "      <td>2600</td>\n",
       "      <td>585000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2</td>\n",
       "      <td>2800</td>\n",
       "      <td>615000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2</td>\n",
       "      <td>3300</td>\n",
       "      <td>650000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2</td>\n",
       "      <td>3600</td>\n",
       "      <td>710000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1</td>\n",
       "      <td>2600</td>\n",
       "      <td>575000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>1</td>\n",
       "      <td>2900</td>\n",
       "      <td>600000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1</td>\n",
       "      <td>3100</td>\n",
       "      <td>620000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>1</td>\n",
       "      <td>3600</td>\n",
       "      <td>695000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    town  area   price\n",
       "0      0  2600  550000\n",
       "1      0  3000  565000\n",
       "2      0  3200  610000\n",
       "3      0  3600  680000\n",
       "4      0  4000  725000\n",
       "5      2  2600  585000\n",
       "6      2  2800  615000\n",
       "7      2  3300  650000\n",
       "8      2  3600  710000\n",
       "9      1  2600  575000\n",
       "10     1  2900  600000\n",
       "11     1  3100  620000\n",
       "12     1  3600  695000"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfle = df\n",
    "dfle.town = le.fit_transform(dfle.town)\n",
    "dfle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = dfle[['town','area']].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[   0, 2600],\n",
       "       [   0, 3000],\n",
       "       [   0, 3200],\n",
       "       [   0, 3600],\n",
       "       [   0, 4000],\n",
       "       [   2, 2600],\n",
       "       [   2, 2800],\n",
       "       [   2, 3300],\n",
       "       [   2, 3600],\n",
       "       [   1, 2600],\n",
       "       [   1, 2900],\n",
       "       [   1, 3100],\n",
       "       [   1, 3600]], dtype=int64)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,\n",
       "       710000, 575000, 600000, 620000, 695000], dtype=int64)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = dfle.price.values\n",
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now use one hot encoder to create dummy variables for each of the town"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "ohe = OneHotEncoder(categorical_features=[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,\n",
       "          2.60000000e+03],\n",
       "       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,\n",
       "          3.00000000e+03],\n",
       "       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,\n",
       "          3.20000000e+03],\n",
       "       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,\n",
       "          3.60000000e+03],\n",
       "       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,\n",
       "          4.00000000e+03],\n",
       "       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,\n",
       "          2.60000000e+03],\n",
       "       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,\n",
       "          2.80000000e+03],\n",
       "       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,\n",
       "          3.30000000e+03],\n",
       "       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,\n",
       "          3.60000000e+03],\n",
       "       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,\n",
       "          2.60000000e+03],\n",
       "       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,\n",
       "          2.90000000e+03],\n",
       "       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,\n",
       "          3.10000000e+03],\n",
       "       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,\n",
       "          3.60000000e+03]])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = ohe.fit_transform(X).toarray()\n",
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = X[:,1:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  0.00000000e+00,   0.00000000e+00,   2.60000000e+03],\n",
       "       [  0.00000000e+00,   0.00000000e+00,   3.00000000e+03],\n",
       "       [  0.00000000e+00,   0.00000000e+00,   3.20000000e+03],\n",
       "       [  0.00000000e+00,   0.00000000e+00,   3.60000000e+03],\n",
       "       [  0.00000000e+00,   0.00000000e+00,   4.00000000e+03],\n",
       "       [  0.00000000e+00,   1.00000000e+00,   2.60000000e+03],\n",
       "       [  0.00000000e+00,   1.00000000e+00,   2.80000000e+03],\n",
       "       [  0.00000000e+00,   1.00000000e+00,   3.30000000e+03],\n",
       "       [  0.00000000e+00,   1.00000000e+00,   3.60000000e+03],\n",
       "       [  1.00000000e+00,   0.00000000e+00,   2.60000000e+03],\n",
       "       [  1.00000000e+00,   0.00000000e+00,   2.90000000e+03],\n",
       "       [  1.00000000e+00,   0.00000000e+00,   3.10000000e+03],\n",
       "       [  1.00000000e+00,   0.00000000e+00,   3.60000000e+03]])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(X,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 681241.6684584])"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict([[0,1,3400]]) # 3400 sqr ft home in west windsor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 590775.63964739])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2 style='color:green'>Exercise</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the same level as this notebook on github, there is an Exercise folder that contains carprices.csv.\n",
    "This file has car sell prices for 3 different models. First plot data points on a scatter plot chart\n",
    "to see if linear regression model can be applied. If yes, then build a model that can answer\n",
    "following questions,\n",
    "\n",
    "**1) Predict price of a mercedez benz that is 4 yr old with mileage 45000**\n",
    "\n",
    "**2) Predict price of a BMW X5 that is 7 yr old with mileage 86000**\n",
    "\n",
    "**3) Tell me the score (accuracy) of your model. (Hint: use LinearRegression().score())**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
